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Abstract — The key approaches for machine learning, especially 
learning in unknown probabilistic environments are new repre- 
sentations and computation mechanisms. In this paper, a novel 
quantum reinforcement learning (QRL) method is proposed by 
combining quantum theory and reinforcement learning (RL). 

, Inspired by the state superposition principle and quantum par- 
allelism, a framework of value updating algorithm is introduced. 
The state (action) in traditional RL is identified as the eigen state 
(eigen action) in QRL. The state (action) set can be represented 
with a quantum superposition state and the eigen state (eigen 
action) can be obtained by randomly observing the simulated 
quantum state according to the collapse postulate of quantum 
measurement. The probability of the eigen action is determined 
by the probability amplitude, which is parallelly updated ac- 
cording to rewards. Some related characteristics of QRL such as 
convergence, optimality and balancing between exploration and 
exploitation are also analyzed, which shows that this approach 
makes a good tradeoff between exploration and exploitation using 
the probability amplitude and can speed up learning through 
the quantum parallelism. To evaluate the performance and 
practicability of QRL, several simulated experiments are given 
and the results demonstrate the effectiveness and superiority of 
QRL algorithm for some complex problems. The present work 
is also an effective exploration on the application of quantum 

I computation to artificial intelligence. 

Index Terms — quantum reinforcement learning, state super- 
, position, collapse, probability amplitude, Grover iteration. 

I. Introduction 

LEARNING methods are generally classified into super- 
vised, unsupervised and reinforcement learning (RL). 
Supervised learning requires explicit feedback provided by 
' input-output pairs and gives a map from inputs to outputs. 
. Unsupervised learning only processes on the input data. In 
' contrast, RL uses a scalar value named reward to evaluate 
the input-output pairs and learns a mapping from states to 
actions by interaction with the environment through trial-and- 
error. Since 1980s, RL has become an important approach 
to machine learning [l]-[22], and is widely used in artificial 
intelligence, especially in robotics [7]-[10], [18], due to its 
good performance of on-line adaptation and powerful learning 
ability to complex nonlinear systems. However there are 
still some difficult problems in practical applications. One 
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problem is the exploration strategy, which contributes a lot to 
better balancing of exploration (trying previously unexplored 
strategies to find better policy) and exploitation (taking the 
most advantage of the experienced knowledge). The other is 
its slow learning speed, especially for the complex problems 
sometimes known as "the curse of dimensionality" when the 
state-action space becomes huge and the number of parameters 
to be learned grows exponentially with its dimension. 

To combat those problems, many methods have been pro- 
posed in recent years. Temporal abstraction and decomposition 
methods have been explored to solve such problems as RL 
and dynamic programming (DP) to speed up learning [11]- 
[15]. Different kinds of learning paradigms are combined to 
optimize RL. For example. Smith [16] presented a new model 
for representation and generalization in model-less RL based 
on the self-organizing map (SOM) and standard Q-learning. 
The adaptation of Watkins' Q-learning with fuzzy inference 
systems for problems with large state-action spaces or with 
continuous state spaces is also proposed [6], [17], [18], [19]. 
Many specific improvements are also implemented to modify 
related RL methods in practice [7], [9], [10], [20], [21], 
[22]. In spite of all these attempts more work is needed to 
achieve satisfactory successes and new ideas are necessary to 
explore more effective representation methods and learning 
mechanisms. In this paper, we explore to overcome some 
difficulties in RL using quantum theory and propose a novel 
quantum reinforcement learning method. 

Quantum information processing is a rapidly developing 
field. Some results have shown that quantum computation 
can more efficiently solve some difficult problems than the 
classical counterpart. Two important quantum algorithms, the 
Shor algorithm [23], [24] and the Grover algorithm [25], 
[26] have been proposed in 1994 and 1996, respectively. The 
Shor algorithm can give an exponential speedup for factoring 
large integers into prime numbers and it has been realized 
[27] for the factorization of integer 15 using nuclear mag- 
netic resonance (NMR). The Grover algorithm can achieve a 
square speedup over classical algorithms in unsorted database 
searching and its experimental implementations have also been 
demonstrated using NMR [28]-[30] and quantum optics [31], 
[32] for a system with four states. Some methods have also 
been explored to connect quantum computation and machine 
learning. For example, the quantum computing version of 
artificial neural network has been studied from the pure theory 
to the simple simulated and experimental implementation [33]- 
[37]. Rigatos and Tzafestas [38] used quantum computation 
for the parallelization of a fuzzy logic control algorithm to 
speed up the fuzzy inference. Quantum or quantum-inspired 
evolutionary algorithms have been proposed to improve the 
existing evolutionary algorithms [39]. Hogg and Portnov [40] 
presented a quantum algorithm for combinatorial optimiza- 
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tion of overconstrained satisfiability (SAT) and asymmetric 
travelling salesman (ATSP). Recently the quantum search 
technique has been used to dynamic programming [41]. Taking 
advantage of quantum computation, some novel algorithms 
inspired by quantum characteristics will not only improve the 
performance of existing algorithms on traditional computers, 
but also promote the development of related research areas 
such as quantum computers and machine learning. Considering 
the essence of computation and algorithms, Dong and his co- 
workers [42] have presented the concept of quantum rein- 
forcement learning (QRL) inspired by the state superposition 
principle and quantum parallelism. Following this concept, we 
in this paper give a formal quantum reinforcement learning 
algorithm framework and specifically demonstrate the advan- 
tages of QRL for speeding up learning and obtaining a good 
tradeoff between exploration and exploitation of RL through 
simulated experiments and some related discussions. 

This paper is organized as follows. Section II contains the 
prerequisite and problem description of standard reinforcement 
learning, quantum computation and related quantum gates. 
In Section III, quantum reinforcement learning is introduced 
systematically, where the state (action) space is represented 
with the quantum state, the exploration strategy based on the 
collapse postulate is achieved and a novel QRL algorithm is 
proposed specifically. Section IV analyzes related character- 
istics of QRL such as the convergence, optimality and the 
balancing between exploration and exploitation. Section V 
describes the simulated experiments and the results demon- 
strate the effectiveness and superiority of QRL algorithm. In 
Section VI, we briefly discuss some related problems of QRL 
for future work. Concluding remarks are given in section VII. 



the agent) st, and then choose an action at- After executing the 
action, the agent receives a reward rt+i, which reflects how 
good that action is (in a short-term sense). The state of the 
environment will change to next state st+i under the action 
Of. The agent will choose the next action at+i according to 
related knowledge. 

The goal of reinforcement learning is to learn a mapping 
from states to actions, that is to say, the agent is to learn a 
policy TT : S X Ui^s^{i) [O7 1]^ so that the expected sum of 
discounted reward of each state will be maximized: 

^{l) = ^{^(t+i) + 7?'(t+2) + . . . |st = s, tt} 

= 5] ^(s,a)[r:+75]pL,y(-)] (1) 

a^As s' 

where 7 G [0, 1] is a discount factor, tt{s, a) is the probability 
of selecting action a according to state s under policy tt, 
p^^, — Pr{st+i — s'\st = s,at = a} is the probability 
for state transition and r° = E{rt+i\st — s,at — a} is the 
expected one-step reward. V(^s) (or V{s)) is also called the 
value function of state s and the temporal difference (TD) 
one-step updating rule of V{s) may be described as 

V{s) '-Vis) +a{r + -/V{s') -V{s)) (2) 

where a G (0, 1) is the learning rate. We have the optimal 
state-value function 

V^l) = maxK + 7 E Ps^' ^(^)] ^3) 



II. Prerequisite and problem description 

In this section we first briefly review the standard reinforce- 
ment learning algorithms and then introduce the background 
of quantum computation and some related quantum gates. 

A. Reinforcement learning (RL) 

Standard framework of reinforcement learning is based on 
discrete-time, finite-state Markov decision processes (MDPs) 
[1]. 

Definition 1 (MDP): A Markov decision process 
(MDP) is composed of the following five factors: 
{5,yl(j),pjj(a),r(i^a),y, i, j G S,a E where: S 

is the state space; A(i-f is the action space for state i; Pij{a) 
is the probability for state transition; r is a reward function, 
r : r — > (-00, +00), where T = {(i,a)|i £ S,a e A(i)}; V 
is a criterion function or objective function. 

According to the definition of MDP, we know that the 
MDP history is composed of successive states and decisions: 
K = (so,ao,si,ai,...,s„_i,a„_i,s„)- The policy tt is a 
sequence: tt = (ttq, tti, . . . ), when the history at n is hn, 
the strategy is adopted to make a decision according to the 
probability distribution 7r„(»|/i„) on ^(s„)- 

RL algorithms assume that the state S and action A(s„) can 
be divided into discrete values. At a certain step t, the agent 
observes the state of the environment (inside and outside of 



TT* = argmaxl/A,Vs e S* (4) 

In dynamic programming, (O is also called the Bellman 
equation of V*. 

As for state-action pairs, there are similar value functions 
and Bellman equations, and QJ^ stands for the value of 
taking the action a in the state s under the policy tt: 

QJsm) = + ir(^t+2) +...\st^s,at=a, tt} 

= r-,«+7EP-'^(^') 

s' 

= r1+lY.P'^s'Y.<''^''')Ql',a') (5) 

s' a' 

%.a) = maxg(,,,) = + 7^^^' maxQ^^, (6) 

s' 

Let a be the learning rate, and the one-step updating rule 
of Q-learning (a widely used RL algorithm) [5] is: 

Q{st,at) ^ (1 - a)Q{st,at) + a{rt+i + 7 max(5(st+i, a')) 

a' 

(7) 

There are many effective standard RL algorithms like Q- 
learning, for example TD(A), Sarsa, etc. For more details see 
[1]. 
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B. State superposition and quantum parallelism 

Analogous to classical bits, the fundamental concept in 
quantum computation is the quantum bit (qubit). The two basic 
states for a qubit are denoted as |0) and which correspond 
to the states and 1 for a classical bit. However, besides |0) 
or |1), a qubit can also he in the superposition state of |0) and 
|1). In other words, a qubit can generally be expressed as 
a Unear combination of |0) and |1) 

|V)=a|0)+/3|l) (8) 

where a and are complex coefficients. This special quantum 
phenomenon is called state superposition principle, which is 
an important difference between classical computation and 
quantum computation [43]. 

The physical carrier of a qubit is any two-state quantum 
system such as a two-level atom, spin-i particle and polarized 
photon. For a physical qubit, when we select a set of bases 
|0) or we indicate that an observable O of the qubit 
system has been chosen and the bases correspond to the two 
eigenvectors of O. For convenience, the measurement process 
on the observable O of a quantum system in corresponding 
state IV') is directly called a measurement of quantum state 
\tp) in this paper. When we measure a qubit in superposition 
state IV-'), the qubit system would collapse into one of its 
basic states |0) or |1). However, we cannot determine in 
advance whether it will collapse to state |0) or |1). We only 
know that we get this qubit in state |0) with probability |ap, 
or in state |1) with probabihty |/?p. Hence a and f3 are 
generally called probability amplitudes. The magnitude and 
argument of probability amplitude represent amplitude and 
phase, respectively. Since the sum of probabilities must be 
equal to 1, a and /3 should satisfy |ap -|- |/3p = 1. 

According to quantum computation theory, a fundamental 
operation in the quantum computing process is a unitary 
transformation U on the qubits. If one appUes a transformation 
[/ to a superposition state, the transformation will act on 
all basis vectors of this state and the output will be a new 
superposition state obtained by superposing the results of 
all basis vectors. It seems that the transformation U can 
simultaneously evaluate the different values of a function f{x) 
for a certain input x and it is called quantum parallelism. The 
quantum parallelism is one of the most important factors to 
acquire the powerful ability of quantum algorithm. However, 
note that this paralleUsm is not immediately useful [44] since 
the direct measurement on the output generally gives only f{x) 
for one value of x. Suppose the input qubit \z) hes in the 
superposition state: 

\z)=a\Q)+m (9) 

The transformation Uz which describes computing process 
may be defined as follows: 

Uz:\z,G)^\z,,f{z)) (10) 

where \z, 0) represents the joint input state with the first qubit 
in \z) and the second qubit in |0), and \z,f{z)) is the joint 
output state with the first qubit in 1^) and the second qubit 



in \f{z)). According to equations (9) and (10), we can easily 
obtain [44]: 

C/,|^,0)=a|0,/(0))+/3|l,/(l)) (11) 

The result contains information about both /(O) and /(I), and 
we seem to evaluate f{z) for two values of z simultaneously. 

The above process corresponds to a "quantum black box" (or 
oracle). By feeding quantum superposition states to a quantum 
black box, we can learn what is inside with an exponential 
speedup, compared to how long it would take if we were only 
allowed classical inputs [43]. 

Now consider an n-qubit system, which can be represented 
with tensor product of n qubits: 

11...1 

|<^) = |Vl)®|V2)®...|Vn)= J2 ^-1^) (12) 

a;=00...0 

where '(g)' means tensor product, X)^iloo...o = 1> C'x 
is complex coefficient and \Cx\^ represents occurrence prob- 
ability of \x) when state \(f)) is measured, x can take on 2" 
values, so the superposition state can be looked upon as the 
superposition of all integers from to 2" — 1. Since U is 
a unitary transformation, computing function f{x) can result 
[43]: 

11. ..1 11. ..1 11...1 

u C:,\x,0)= c-u\x,0)= Y c:,\x,f{x)) 

a:=00...0 a;=00...0 2;=00...0 

(13) 

Based on the above analysis, it is easy to find that an n-qubit 
system can simultaneously process 2" states although only one 
of the 2" states is accessible through a direct measurement 
and the ability is required to extract information about more 
than one value of /(x) from the output superposition state 
[44]. This is different from classical parallel computation, 
where multiple circuits built to compute are executed simulta- 
neously, since quantum parallelism doesn't necessarily make 
a tradeoff between computation time and needed physical 
space. In fact, quantum paralleUsm employs a single circuit 
to simultaneously evaluate the function for multiple values 
by exploiting the quantum state superposition principle and 
provides an exponential- scale computation space in the n-qubit 
linear physical space. Therefore quantum computation can 
effectively increase the computing speed of some important 
classical functions. So it is possible to obtain significant result 
through fusing the quantum computation into the reinforce- 
ment learning theory. 

C. Quantum Gates 

In the classical computation, the logic operators that can 
complete some specific tasks are called logic gates, such as 
NOT gate, AND gate, XOR gate, and so on. Analogously, 
quantum computing tasks can be completed through quantum 
gates. Nowadays some simple quantum gates such as quan- 
tum NOT gate and quantum CNOT gate have been built in 
quantum computation. Here we only introduce two important 
quantum gates, Hadamard gate and phase gate, which are 
closely related to accomplish some quantum logic operations 
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for the present quantum reinforcement learning. The detailed 
discussion about quantum gates can be found in the Ref. [44]. 

Hadamard gate (or Hadamard transform) is one of the most 
useful quantum gates and can be represented as [44]: 



(14) 



Through Hadamard gate, a qubit in the state |0) is transformed 
into an equally weighted superposition state of |0) and |1), i.e. 



1 



(15) 



Similarly, a qubit in the state |1) is transformed into the 
superposition state -^|0) — i-C- the magnitude of the 

amplitude in each state is but the phase of the amplitude 
in the state |1) is inverted. In classical probabiUstic algorithms, 
the phase has no analog since the amplitudes are in general 
complex numbers in quantum mechanics. 

The other related quantum gate is the phase gate (condi- 
tional phase shift operation) which is an important element to 
carry out the Grover iteration for reinforcing "good" decision. 
According to quantum information theory, this transformation 
may be efficiently implemented on a quantum computer. For 
example, the transformation describing this for a two-state 
system is of the form: 



phase 



1 

e'f 



(16) 



where i 



-1 and ip is arbitrary real number [26]. 



III. Quantum reinforcement learning (QRL) 

Just like the traditional reinforcement learning, a quantum 
reinforcement learning system can also be identified for three 
main subelements: a policy, a reward function and a model 
of the environment (maybe not explicit). But quantum rein- 
forcement learning algorithms are remarkably different from 
all those traditional RL algorithms in the following intrinsic 
aspects: representation, policy, parallelism and updating oper- 
ation. 

A. Representation 

As we represent a QRL system with quantum concepts, 
similarly, we have the following definitions and propositions 
for quantum reinforcement learning. 

Definition 2 (Eigen states (or eigen actions)): Select an 
observable of a quantum system and its eigenvectors form a 
set of complete orthonormal bases in a Hilbert space. The 
states s (or actions a) in Definition 1 are denoted as the 
corresponding orthogonal bases and are called the eigen states 
or eigen actions in QRL. 

Remark 1: In the remainder of this paper, we indicate that 
an observable has been chosen but we do not present the 
observable specifically when mentioning a set of orthogonal 
bases. From Definition 2, we can get the set of eigen states: S, 
and that of eigen actions for state i: ^(i). The eigen state (eigen 
action) in QRL corresponds to the state (action) in traditional 
RL. According to quantum mechanics, the quantum state for 



a general closed quantum system can be represented with a 
unit vector (Dirac representation) in a Hilbert space. The 
inner product of \tpi) and IV'2) can be written into {ijji\'ip2) 
and the normaUzation condition for is (^/'IV') = 1. As the 
simplest quantum mechanical system, the state of the qubit 
can be described as dHJ and its normalization condition is 
equivalent to |ap + 1. 

Remark 2: According to the superposition principle in 
quantum computation, since a quantum reinforcement learning 
system can lie in some orthogonal quantum states, which 
correspond to the eigen states (eigen actions), it can also lie in 
an arbitrary superposition state. That is to say, a QRL system 
which can take on the states (or actions) \ipn) is also able to 
occupy their linear superposition state (or action) 
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It is worth noting that this is only a representation method and 
our goal is to take advantage of the quantum characteristics 
in the learning process. In fact, the state (action) in QRL is 
not a practical state (action) and it is only an artificial state 
(action) for computing convenience with quantum systems. 
The practical state (action) is the eigen state (eigen action) 
in QRL. For an arbitrary state (or action) in a quantum 
reinforcement learning system, we can obtain Proposition 1 . 

Proposition 1: An arbitrary state \S) (or action | A)) in QRL 
can be expanded in terms of an orthogonal set of eigen states 
\sn) (or eigen actions |a„)), i.e. 



\S) 
\A) 



, an\Sn) 



0n\an 



(18) 



(19) 



where Q!„ and /3„ are probability amplitudes, and satisfy 
E„KP-land EJ/3«P = 1. 

Remark 3: The states and actions in QRL are different from 
those in traditional RL: (1) The sum of several states (or 
actions) does not have a definite meaning in traditional RL, 
but the sum of states (or actions) in QRL is still a possible 
state (or action) of the same quantum system. (2) When \S) 
takes on an eigen state |s„), it is exclusive. Otherwise, it has 
the probability of |q!„P to be in the eigen state |s„). The same 
analysis also is suitable to the action \A). 

Since quantum computation is built upon the concept of 
qubit as what has been described in Section II, for the 
convenience of processing, we consider to use multiple qubit 
systems to express states and actions and propose a formal 
representation of them for the QRL system. Let Ng and Na 
be the number of states and actions, then choose numbers m 
and n, which are characterized by the following inequalities: 



Ng < 2" < 2Ns, iVa < 2" < 2Na 



(20) 



And use m and n qubits to represent eigen state set = {|s,;)} 
and eigen action set A = {\aj)} respectively, we can obtain 
the corresponding relations as follows: 



E 

i=l 




(21) 
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(22) 



In other words, the states (or actions) of a QRL system may 
he in the superposition state of eigen states (or eigen actions). 
Inequahties in ( |20] i guarantee that every states and actions in 
traditional RL have corresponding representation with eigen 
states and eigen actions in QRL. The probabihty ampHtude 
Cs and Ca are complex numbers and satisfy 



(23) 




operation which can simultaneously process these 2™ states 
with the TD(0) value updating rule 



V{s) ^ V{s) + a{r + -/V{s') - Vis)) 



(27) 



where a is the learning rate, and the meaning of reward r and 
value function V is the same as that in traditional RL. It is 
like parallel value updating of traditional RL over all states. 
However, it provides an exponential-scale computation space 
in the m-qubit linear physical space and can speed up the 
solutions of related functions. In this paper we will simulate 
QRL process on the traditional computer in Section V. How to 
realize some specific functions of the algorithm using quantum 
gates in detail is our future work. 




(24) 



B. Action selection policy 

In QRL, the agent is also to learn a policy tt : S x 
\JiesA(^i) — > [0, 1], which will maximize the expected sum of 
discounted reward of each state. That is to say, the mapping 
from states to actions is tt : S ^ A, and we have 



11- --l 



(25) 



a=00---0 



where probability amplitude Ca satisfies ( l24b . Here, the action 
selection policy is based on the collapse postulate: 

Definition 3 (Action collapse): When an action \A) = 
J2n (3n\an) IS measured, it will be changed and collapse ran- 
domly into one of its eigen actions |a,i) with the corresponding 
probability |(a„|^)p: 



|(a„|^)P = |(K))1A)p 



l(|an»*E^»K)l' = l/^"l' 



(26) 

Remark 4: According to Definition 3, when an action |as ) 
in ( l25T l is measured, we will get \a) with the occurrence 
probability of |Cap. In QRL algorithm, we will amplify 
the probability of "good" action according to corresponding 
rewards. It is obvious that the collapse action selection method 
is not a real action selection policy theoretically. It is just a 
fundamental phenomenon when a quantum state is measured, 
which results in a good balancing between exploration and 
exploitation and a natural "action selection" without setting 
parameters. More detailed discussion about the action selection 
can also be found in Ref. [45] 

C. Paralleling state value updating 

In Proposition 1, we pointed out that every possible state of 
QRL I S) can be expanded in terms of an orthogonal complete 
set of eigen states |s„): \S) = J2n'^ri\sn)- According to 
quantum parallelism, a certain unitary transformation U on 
the qubits can be implemented. Suppose we have such an 



D. Probability amplitude updating 

In QRL, action selection is executed by measuring action 
|ai"^) related to certain state which will collapse to \a) 
with the occurrence probability of |Cap. So it is no doubt 
that probability amplitude updating is the key of recording the 
"trial-and-error" experience and learning to be more intelli- 
gent. 

As the action |ai"^) is the superposition of 2" possible eigen 
actions, finding out \a) is usually interacting with changing its 
probability amplitude for a quantum system. The updating of 
probability amplitude is based on the Grover iteration [26]. 
First, prepare the equally weighted superposition of all eigen 
actions 




(28) 



This process can be done easily by applying n Hadamard gates 
in sequence to n independent qubits with initial states in |0) 
respectively [26], which can be represented into: 



i?«"|00---0) 



^( E 



(29) 



We know that \a) is an eigen action, irrespective of the value 
of a, so that 

|H4"))| = 4^ (30) 



To construct the Grover iteration we will combine two 
reflections Ua and U (n) [44] 



Ua = I-2\a){a\ 



U 



(") 



i7®"(2|0)(0| = 2|a[,' 



(")\/„(")i 



^0 



(31) 
/ (32) 



where / is unitary matrix with appropriate dimensions and 
Ua corresponds to the oracle O in the Grover algorithm [44]. 
The external product \a){a\ is defined \a){a\ = |a)(|a))*. 
Obviously, we have 

Ua\a) - (/ - 2\a){a\)\a) = |a> - 2\a) - -\a) (33) 

Ua\a^)^il~2\a){a\)\a^) = \a^) (34) 
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{a) 



J I I L 



I I I ••• I I I 



I 1^ 



Initial state 



Grover iteration 



|C p mt(jr/ 45) times 



^ Grover iteration 

I 1^ L times 



Fig. 1. The schematic of a single Grover iteration. Ua flips \as) into |a^> Fig. 2. The effect of Grover iterations in Grover algorithm and QRL. (a) 
and U („) flips \a'^) into |a"). One Grover iteration Uorov rotates \as) by Initial state; (b) Grover iterations for amplifying |CaP to almost 1; (c) Grover 



29. 



iterations for reinforcing action \a) to probability sin [(2L + 1)6] 



where la"*") represents an arbitrary state orthogonal to \a). 
Hence Ua flips the sign of the action \a), but acts triviafly 
on any action orthogonal to \a). This transformation has a 
simple geometrical interpretation. Acting on any vector in the 
2" -dimensional Hilbert space, Ua reflects the vector about the 
hyperplane orthogonal to \a). Analogous to the analysis in 
the Grover algorithm, Ua can be looked upon as a quantum 
black box, which can effectively justify whether the action is 
the "good" eigen action. Similarly, U preserves Ian"''), but 

flips the sign of any vector orthogonal to laj," ^). 

Thus one Grover iteration is the unitary transformation [28], 
[44] 

Uarov^UA^.Ua (35) 

Now let's consider how the Grover iteration acts in the plane 
spanned by \a) and \a^Q^). The initial action in equation 
can be re-expressed as 



f{s) 
Recall fliat 



i(4"V>i 



1 



2" - 1, 



Thus 



f{s) = ^^sine'la) +cos6l|a-L). 



(36) 



(37) 



(38) 



This procedure of Grover iteration Ucrov can be visualized 
geometrically by Fig. 1. 

This figure shows that Icq"') is rotated by 6 from the axis 
\a^) normal to \a) in the plane. Ua reflects a vector \as) in 
the plane about the axis \a^) to |a'), and U („> reflects the 

(n) 

vector \a'g) about the axis {uq ) to |a"). From Fig. 1 we know 



that 



■13 



(39) 



Thus we have a + (3 = 29. So one Grover iteration Ucrov = 
U MUa rotates any vector jos) by 29. 

^e now can carry out a certain times of Grover iterations 
to update the probability amplitudes according to respective 
rewards and value functions. It is obvious that 29 is the 
updating stepsize. Thus when an action \a) is executed, the 



probability amplitude of |ai"'') is updated by carrying out 
L — int(fc(r + y(s'))) times of Grover iterations, where int(2:) 
returns the integer part of x. fc is a parameter which indicates 
that the times L of iterations is proportional to r + V{s'). 
The selection of its value is experiential in this paper and its 
optimization is an open question. The probabiUty amplitudes 
will be normalized with I^^P = 1 after each updating. 
According to Ref. [46], we know that applying Grover iteration 

can be represented as 



Ucrov for L times on jag 



(")\ 



U, 



Grovl'^l 



sin[(2i + 1) 



-cos[(2i + 1) 



(40) 



Obviously, we can reinforce the action \a) from probability ^ 
to sin^[(2i+l)6'] through Grover iterations. Since sin(2i+l)6' 
is a periodical function about {2L+1)9 and too much iterations 
may also cause small probability sin^[(2iy + 1)9], we further 
select L = min{int(fc(r + V{s'))),m{^ - i)}. 

Remark 5: The probability amplitude updating is inspired 
by the Grover algorithm and the two procedures use the same 
amplitude amplification technique as a subroutine. Here we 
want to emphasize the difference between the probability 
amplitude updating and Grover's database searching algo- 
rithm. The objective of Grover algorithm is to search \a) by 
amplifying its occurrence probability to almost 1, however, 
the aim of probability amplitude updating process in QRL 
just appropriately updates (amplifies or shrinks) corresponding 
amplitudes for "good" or "bad" eigen actions. So the essential 
difference is in the times L of iterations and this can be 
demonstrated by Fig. 2. 

E. QRL algorithm 

Based on the above discussion, the procedural form of 
a standard QRL algorithm is described as Fig. 3. In QRL 
algorithm, after initializing the state and action we can observe 
jai"'') and obtain an eigen action |a). Execute this action and 
the system can give out next state |s'), reward r and state value 
V{s'). V{s) is updated by TD(0) rule, and r and V{s') can 
be used to determine the iteration times L. To accomplish the 
task in a practical computing device, we require some basic 
registers for the storage of related information. Firstly two m- 
qubit registers are required for all eigen states and their state 
values V{s), respectively. Secondly every eigen state requires 
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Procednrnl QRL : 

Initialize |j^"'h= \s).f(s)=\a'f^}= VC'_, |„) a„d F(j) ai'bib-arily 



s=00-"0 

Repeat (for each episode) 

ii-.-i 

For all states | s) in | i'"^) = | s) : 

s=OD-"0 

1 . Observe f(s) =\ a^^} and get | a) ; 

2. Take action \a) , obseiTe next state | s'}, reward r , then 

(a) Update state value: V(s)^V(s) + a(r + rV(s')-V(sy) 

(b) Update probability amplitudes: 

repeat U^^^ for Z times 

Until for all states | AV(s) |< s . 



Fig. 3. The algorithm of a standard quantum reinforcement learning (QRL) 



two n-qubit registers for their respective eigen actions stored 
for two times, where one n-qubit register stores the action 
|ai"^) to be observed and the other n-qubit register also stores 
the same action for preventing the memory loss associated 
to the action collapse. It is worth mentioning that this does 
not conflict with the no-cloning theorem [44] since the action 
|ai"^) is a certain known state at each step. Finally several 
simple classical registers may be required for the reward r, 
the times L, and etc. 

Remark 6: QRL is inspired by the superposition principle 
of quantum state and quantum parallelism. The action set can 
be represented with the quantum state and the eigen action can 
be obtained by randomly observing the simulated quantum 
state, which will lead to state collapse according to the 
quantum measurement postulate. The occurrence probability 
of every eigen action is determined by its corTesponding 
probability amplitude, which is updated according to rewards 
and value functions. So this approach represents the whole 
state-action space with the superposition of quantum state and 
makes a good tradeoff between exploration and exploitation 
using probability. 

Remark 7: The merit of QRL is dual. First, as for simu- 
lation algorithm on the traditional computer it is an effective 
algorithm with novel representation and computation methods. 
Second, the representation and computation mode are consis- 
tent with quantum parallelism and can speed up learning with 
quantum computers or quantum gates. 

IV. Analysis of QRL 

In this section, we discuss some theoretical properties of 
QRL algorithms and provide some advice from the point of 
view of engineering. Four major results are presented: (1) an 
asymptotic convergence proposition for QRL algorithms, (2) 
the optimality and stochastic algorithm, (3) good balancing 
between exploration and exploitation, and (4) physical real- 
ization. From the following analysis, it is obvious that QRL 



shows much better performance than other methods when the 
searching space becomes very large. 

A. Convergence of QRL 

In QRL we use the temporal difference (TD) prediction for 
the state value updating, and TD algorithm has been proved 
to converge for absorbing Markov chain [4] when the learning 
rate is nonnegative and degressive. To generally consider the 
convergence results of QRL, we have Proposition 2. 

Proposition 2 (Convergence of QRL): For any Markov 
chain, QRL algorithms converge to the optimal state value 
function V*{s) with probability 1 under proper exploration 
policy when the following conditions hold (where ak is 
learning rate and nonnegative): 

T T 

lim > ak — oo, lim > ai < oo (41) 

k=l k=l 

Proof: (sketch) Based on the above analysis, QRL is a 
stochastic iterative algorithm. Bertsekas and Tsitsiklis have 
verified the convergence of stochastic iterative algorithms [3] 
when holds. In fact many traditional RL algorithms have 
been proved to be stochastic iterative algorithms [3], [4], [47] 
and QRL is the same as traditional RL, and main differences 
lie in: 

(1) Exploration policy is based on the collapse postulate of 
quantum measurement while being observed; 

(2) This kind of algorithms is carTied out by quantum 
parallelism, which means we update all states simultaneously 
and QRL is a synchronous learning algorithm. 

So the modification of RL does not affect the characteristic 
of convergence and QRL algorithm converges when (|4T]) 
holds. ■ 

B. Optimality and stochastic algorithm 

Most quantum algorithms are stochastic algorithms which 
can give the correct decision-making with probability 1-e 
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(e > 0, close to 0) after several times of repeated computing 
[23], [25]. As for quantum reinforcement learning algorithms, 
optimal policies are acquired by the collapse of quantum 
system and we will analyze the optimality of these policies 
from two aspects as follows. 

1 ) QRL implemented by real quantum apparatuses: When 
QRL algorithms are implemented by real quantum appara- 
tuses, the agent's strategy is given by the collapse of corre- 
sponding quantum system according to probability amplitude. 
QRL algorithms can not guarantee the optimality of every 
strategy, but it can give the optimal decision-making with 
the probability approximating to 1 by repeating computation 
several times. Suppose that the agent gives an optimal strategy 
with the probability 1 — e after the agent has well learned 
(state value function converges to V*{s)). For e e (0,1), 
the error probability is e'^ by repeating d times. Hence the 
agent will give the optimal strategy with the probability of 
1 — e'' by repeating the computation for d times. The QRL 
algorithms on real quantum apparatuses are still effective due 
to the powerful computing capability of quantum system. Our 
current work has been focused on simulating QRL algorithms 
on the traditional computer which also bear the characteristics 
inspired by quantum systems. 

2) Simulating QRL on the traditional computer: As men- 
tioned above, in this paper most work has been done to develop 
this kind of novel QRL algorithms by simulating on the 
traditional computer. But in traditional RL theory, researchers 
have argued that even if we have a complete and accurate 
model of the environment's dynamics, it is usually not possible 
to simply compute an optimal policy by solving the Bellman 
optimahty equation [1]. What's the fact about QRL? In QRL, 
the optimal value functions and optimal poUcies are defined 
in the same way as traditional RL. The difference Ues in the 
representation and computing mode. The policy is probabilistic 
instead of being definite using probability amphtude, which 
makes it more effective and safer. But it is still obvious that 
simulating QRL on the traditional computer can not speed up 
learning in exponential scale since the quantum parallehsm 
is not really executed through real physical systems. What's 
more, when more powerful computation is available, the agent 
will learn much better. Then we may fall back on physical 
realization of quantum computation again. 

C. Balancing between exploration and exploitation 

One widely used action selection scheme is c- greedy [48], 
[49], where the best action is selected with probability (1 — e) 
and a random action is selected with probability e(ee (0,1)). 
The exploration probability e can be reduced over time, which 
moves the agent from exploration to exploitation. The e-greedy 
method is simple and effective but it has one drawback that 
when it explores it chooses equally among all actions. This 
means that it makes no difference to choose the worst action 
or the next-to-best action. Another problem is that it is difficult 
to choose a proper parameter e which can offer an optimal 
balancing between exploration and exploitation. 

Another kind of action selection scheme is Boltzmann 
exploration (including Softmax action selection method) [1], 



[48], [49]. It uses a positive parameter r called the temper- 
ature and chooses action with the probabihty proportional to 
gQ(s> can move from exploration to exploitation by 

adjusting the "temperature" parameter t. It is natural to sample 
actions according to this distribution, but it is very difficult 
to set and adjust a good parameter r. There are also similar 
problems with simulated annealing (SA) methods [50]. 

We have introduced the action selecting strategy of QRL in 
Section III, which is called collapse action selection method. 
The agent does not bother about selecting a proper action 
consciously. The action selecting process is just accomplished 
by the fundamental phenomenon that it will naturally collapse 
to an eigen action when an action (represented by quantum 
superposition state) is measured. In the learning process, the 
agent can explore more effectively since the state and action 
can lie in the superposition state through parallel updating. 
When an action is observed, it will collapse to an eigen action 
with a certain probabihty. Hence QRL algorithm is essentially 
a kind of probability algorithm. However, it is greatly different 
from classical probabihty since classical algorithms forever 
exclude each other for many results, but in QRL algorithm 
it is possible for many results to interfere with each other to 
yield some global information through some specific quantum 
gates such as Hadmard gates. Compared with other exploration 
strategy, this mechanism leads to a better balancing between 
exploration and exploitation. 

In this paper, the simulated results will show that the 
action selection method using the collapse phenomenon is very 
extraordinary and effective. More important, it is consistent 
with the physical quantum system, which makes it more 
natural, and the mechanism of QRL has the potentiiil to be 
implemented by real quantum systems. 

D. Physical realization 

As a quantum algorithm, the physical reaUzation of QRL is 
also feasible since the two main operations occur in preparing 
the equally weighted superposition state for initiahzing the 
quantum system and carrying out a certain times of Grover 
iterations for updating probability amphtude according to 
rewards and value functions. These are the same operations 
needed in the Grover algorithm. They can be accomplished 
using different combinations of Hadamard gates and phase 
gates. So the physical realization of QRL has no difficulty 
in principle. Moreover, the experimental implementations of 
the Grover algorithm also demonstrate the feasibility for the 
physical reaUzation of our QRL algorithm. 

V. Experiments 

To evaluate QRL algorithm in practice, consider the typical 
gridworld example. The gridworld environment is as shown in 
Fig. 4 and each cell of the grid corresponds to an individual 
state (eigen state) of the environments. From any state the 
agent can perform one of four primary actions (eigen actions): 
up, down, left and right, and actions that would lead into a 
blocked cell are not executed. The task of the algorithms is to 
find an optimal policy which will let the agent move from start 
point S to goal point G with minimized cost (or maximized 
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Fig. 4. The example is a gridworld environment with cell-to-cell actions 
(up, down, left and right). The labels S and G indicate the initial state and 
the goal in the simulated experiment described in the text. 



rewards). An episode is defined as one time of learning process 
when the agent moves from the start state to the goal state. 
But when the agent cannot find the goal state in a maximum 
steps (or a period of time), this episode will be terminated and 
start another episode from the start state again. So when the 
agent finds an optimal policy through learning, the number of 
moving steps for one episode will reduce to a minimum one. 



A. Experimental set-up 

In this 20 x 20 (0 19) gridworld, the initial state S and the 
goal state G is cell(l,l) and cell(18,18) and before learning 
the agent has no information about the environment at all. 
Once the agent finds the goal state it receives a reward of 
r = 100 and then ends this episode. All steps are punished by 
a reward of r = —1. The discount factor 7 is set to 0.99 
and all of the state values V(s) are initialized as V ~ 
for all the algorithms that we have carried out. In the first 
experiment, we compare QRL algorithm with TD(0) and we 
also demonstrate the expected result on a quantum computer 
theoretically. In the second experiment, we give out some 
results of QRL algorithm with different learning rates. For 
the action selection policy of TD algorithm, we use e-greedy 
policy (e = 0.01), that is to say, the agent executes the "good" 
action with probabiUty 1 — e and chooses other actions with 
an equal probability. As for QRL, the action selecting policy 
is obviously different from traditional RL algorithms, which is 
inspired by the collapse postulate of quantum measurement. 
The value of |Cap is used to denote the probability of an 
action defined as /(s) = |ai"'') = J2a=oo---o^a\0')- For the 
four cell-to-ceU actions, i.e. four eigen actions up, down, left 
and right, |CaP is initiahzed uniformly. 



B. Experimental results and analysis 

Learning performance for QRL algorithm compared with 
TD algorithm in traditional RL is plotted in Fig. 5, where 
the cases with the good performance are chosen for both of 
the QRL and TD algorithms. As shown in Fig. 5, the good 
cases in this gridworld example are respectively TD algorithm 
with the learning rate of a = 0.01 and QRL algorithm 
with a = 0.06. The horizontal axis represents the episode 
in the learning process and the number of steps required 
is correspondingly described by the vertical coordinate. We 
observe that QRL algorithm is also an effective algorithm on 
the traditional computer although it is inspired by the quantum 
mechanical system and is designed for quantum computers in 
the future. For their respective rather good cases in Fig. 5, 
QRL explores more than TD algorithm at the beginning of 
learning phase, but it learns much faster and guarantees a better 
balancing between exploration and exploitation. In addition, it 
is much easier to tune the parameters for QRL algorithms 
than for traditional ones. If the real quantum parallelism is 
used, we can obtain the estimated theoretical results. What's 
more important, according to the estimated theoretical results, 
QRL has great potential of powerful computation provided 
that the quantum computer (or related quantum apparatuses) 
is available in the future, which will lead to a more effective 
approach for the existing problems of learning in complex 
unknown environments. 

Furthermore, in the following comparison experiments we 
give the results of TD(0) algorithm in QRL and RL algorithms 
with different learning rates, respectively. In Fig. 6 it illustrates 
the results of QRL algorithms with different learning rates: a 
(alpha), from 0.01 to 0.11, and to give a particular description 
of the learning process, we record every learning episodes. 
From these figures, it can been concluded that given a proper 
learning rate (0.02 < alpha < 0.10) this algorithm learns fast 
and explores much at the beginning phase, and then steadily 
converges to the optimal policy that costs 36 steps to the goal 
state G. As the learning rate increases from 0.02 to 0.09, 
this algorithm learns faster. When the learning rate is 0.01 or 
smaller, it explores more but learns very slow, so the learning 
process converges very slowly. Compared with the result of 
TD in Fig. 5, we find that the simulation result of QRL on 
the classical computer does not show advantageous when the 
learning rate is small (alpha < 0.01). On the other hand, 
when the learning rate is 0.11 or above, it cannot converge to 
the optimal policy because it vibrates with too large learning 
rate when the policy is near the optimal policy. Fig. 7 shows 
the performance of TD(0) algorithm, and we can see that the 
learning process converges with the learning rate of 0.01. But 
when the learning rate is bigger (alpha=0.02, 0.03 or bigger), 
it becomes very hard for us to make it converge to the optimal 
policy within 10000 episodes. Anyway from Fig. 6 and Fig. 7, 
we can see that the convergence range of QRL algorithm is 
much larger than that of traditional TD(0) algorithm. 

All the results show that QRL algorithm is effective and 
excels traditional RL algorithms in the following three main 
aspects: (1) Action selecting policy makes a good tradeoff 
between exploration and exploitation using probability, which 
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Fig. 5. Performance of QRL in the example of a gridworld environment compared with TD algorithm (e-greedy policy) for their respective good cases, and 
the expected theoretical result on a quantum computer is also demonstrated. 



speeds up the learning and guarantees the searching over the 
whole state-action space as well. (2) Representation is based 
on the superposition principle of quantum mechanics and the 
updating process is carried out through quantum parallelism, 
which will be much more prominent in the future when 
practical quantum apparatus comes into use instead of being 
simulated on the traditional computers. (3) Compared with 
the experimental results in Ref. [51], where the simulation 
environment is a 13 x 13 (0 ~ 12) gridworld, we can see that 
when the state space is getting larger, the performance of QRL 
is getting better than traditional RL in simulated experiments. 

VI. Discussion 

The key contribution of this paper is a novel reinforcement 
learning framework called quantum reinforcement learning 
that integrates quantum mechanics characteristics and rein- 
forcement learning theories. In this section some associated 
problems of QRL on the traditional computer are discussed 
and some future work regarded as important is also pointed 
out. 

Although it is a long way for implementing such compli- 
cated quantum systems as QRL by physical quantum systems, 
the simulated version of QRL on the traditional computer has 
been proved effective and also excels standard RL methods in 
several aspects. To improve this approach some issues of future 
work is laid out as follows, which we deem to be important. 



Model of environments An appropriate model of the 
environment wiU make problem-solving much easier and 
more efficient. This is true for most of the RL algorithms. 
However, to model environments accurately and simply 
is a tradeoff problem. As for QRL, this problem should 
be considered slightly differently due to some of its 
specialities. 

Representations The representations for QRL algorithm 
according to different kinds of problems would be natu- 
rally of interest ones when a learning system is designed. 
In this paper, we mainly discuss problems with discrete 
states and actions and a natural question is how to extend 
QRL to the problems with continuous states and actions 
effectively. 

Function approximation and generalization General- 
ization is necessary for RL systems to be applied to 
artificial intelligence and most engineering applications. 
Function approximation is an important approach to 
acquire generalization. As for QRL, this issue will be 
a rather challenging task and function approximation 
should be considered with the special computation mode 
of QRL. 

Theory QRL is a new learning framework that is different 
from standard RL in several aspects, such as representa- 
tion, action selection, exploration policy, updating style, 
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Fig. 6. Comparison of QRL algorithms with different learning rates (alpha= 0.01 ~ 0.11). 




Fig. 7. 



Comparison of TD(0) algorithms with different learning rates (alpha=0.01, 0.02, 0.03). 



12 



etc. So there is a lot of theoretical work to do to take most 
advantage of it, especially to analyze the complexity of 
the QRL algorithm and improve its representation and 
computation. 

• More applications Besides more theoretical research, 
a tremendous opportunity to apply QRL algorithms to 
a range of problems is needed to testify and improve 
this kind of learning algorithms, especially in unknown 
probabilistic environments and large learning space. 
Anyway we strongly believe that QRL approaches and 
related techniques wiU be promising for agent learning in 
large scale unknown environment. This new idea of applying 
quantum characteristics will also inspire the research in the 
area of machine learning. 

VII. Concluding Remarks 

In this paper, QRL is proposed based on the concepts 
and theories of quantum computation in the Ught of the 
existing problems in RL algorithms such as tradeoff between 
exploration and exploitation, low learning speed, etc. Inspired 
by state superposition principle, we introduce a framework of 
value updating algorithm. The state (action) in traditional RL is 
looked upon as the eigen state (eigen action) in QRL. The state 
(action) set can be represented by the quantum superposition 
state and the eigen state (eigen action) can be obtained by ran- 
domly observing the simulated quantum state according to the 
collapse postulate of quantum measurement. The probability 
of eigen state (eigen action) is determined by the probability 
amphtude, which is updated according to rewards and value 
functions. So it makes a good tradeoff between exploration 
and exploitation and can speed up learning as well. At the 
same time this novel idea will promote related theoretical and 
technical research. 

On the theoretical side, it gives us more inspiration to 
look for new paradigms of machine learning to acquire better 
performance. It also introduces the latest development of 
fundamental science, such as physics and mathematics, to the 
area of artificial intelhgence and promotes the development 
of those subjects as well. Especially the representation and 
essence of quantum computation are different from classical 
computation and many aspects of quantum computation are 
Ukely to evolve. Sooner or later machine learning will also 
be profoundly influenced by quantum computation theory. We 
have demonstrated the applicability of quantum computation 
to machine learning and more interesting results are expected 
in the near future. 

On the technical side, the results of simulated experiments 
demonstrate the feasibiUty of this algorithm and show its 
superiority for the learning problems with huge state spaces 
in unknown probabilistic environments. With the progress of 
quantum technology, some fundamental quantum operations 
are being realized via nuclear magnetic resonance, quantum 
optics, cavity-QED and ion trap. Since the physical reahzation 
of QRL mainly needs Hadamard gates and phase gates and 
both of them are relatively easy to be implemented in quantum 
computation, our work also presents a new task to implement 
QRL using practical quantum systems for quantum compu- 
tation and will simultaneously promote related experimental 



research [51]. Once QRL becomes realizable on real physical 
systems, it can be effectively used to quantum robot learning 
for accomphshing some significant tasks [52J, [53J. 

Quantum computation and machine learning are both the 
study of the information processing tasks. The two research 
fields have rapidly grown so that it gives birth to the combining 
of traditional learning algorithms and quantum computation 
methods, which will influence representation and learning 
mechanism, and many difficult problems could be solved 
appropriately in a new way. Moreover, this idea also pioneers a 
new field for quantum computation and artificial intelligence 
[52], [53], and some efficient appUcations or hidden advan- 
tages of quantum computation are probably approached from 
the angle of learning and intelligence. 
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